Is Similarity Search Useful for High Dimensional Spaces?
نویسندگان
چکیده
Extended Abstract In recent years, multimedia content-based retrieval has become an important research problem. In order to provide effective and also efficient access to relevant data stored in large (often distributed) digital repositories, advanced software tools are necessary. Content-based retrieval works on the idea of abstracting the contents of an object, for example color or shape in the case of images, by so-called features – features are typically points in a high-dimensional vector space. Instead of determining the similarity of two objects based on their raw data, only the much smaller feature representations are used to estimate the objects' similarity. Given a reference (query) object represented by its features, similarity predicates are defined to retrieve a specific number of best cases or all objects satisfying a (distance) constraint. In this respect, we can distinguish between similarity range and nearest neighbor (NN) queries. Usefulness of NN-search. The dimensionality of a single feature type may range from small (4-9) to large (several hundreds). Frequently, it is not sufficient to query one feature in isolation. Rather, a query typically combines several feature types to better reflect the notion of similarity. Given this, the number of dimensions of combined features is well above 10. Beyer et al. [2] have recently questioned NN-search in high-dimensional vector spaces. The result of their theoretical analysis over a uniformly distributed data set is that NN-search as an implementation of similarity search is questionable if the number of dimensions is large, e.g. above 16. On the other hand, experiments with feature data of a large image database have shown that their finding must not always hold when applied to real data. It is still an open question how to find out whether the retrieved result of a query is of satisfactory quality for the user. Efficiency of NN-search. The feature approach to similarity search typically reduces the amount of searched data by orders of magnitude, e.g. from the tera-byte range to the giga-byte range. Since this reduction of data volume is not enough for large collections, researchers have proposed a number of methods which are mostly based on data-space partitioning. Index trees such as R-tree [5], X-tree [1], SR-tree [6], M-tree [3] or TV-tree [7] divide the data-space according to the distribution of data objects inserted or loaded into the tree. The main objective is to prune the search space in such a way that the NN can be …
منابع مشابه
Efficiently Indexing High-Dimensional Data Spaces
Indexing high-dimensional data spaces is an emerging research domain. It gains increasing importance by the need to support modern applications by powerful search tools. In the so-called non-standard applications of database systems such as multimedia, CAD, molecular biology, medical imaging, time series processing and many others, similarity search in large data sets is required as a basic fun...
متن کاملHigh-Dimensional Simplexes for Supermetric Search
In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These include Euclidean, Cosine and Jensen-Shannon spaces...
متن کاملRetrieval of Optimal Subspace Clusters Set for an Effective Similarity Search in a High-Dimensional Spaces
High dimensional data is often analysed resorting to its distribution properties in subspaces. Subspace clustering is a powerfull method for elicication of high dimensional data features. The result of subspace clustering can be an essential base for building indexing structures and further data search. However, a high number of subspaces and data instances can conceal a high number of subspace...
متن کاملThe Theory and Practice of Similarity Searches in High Dimensional Data Spaces
Similarity search in multimedia databases is typically performed on abstractions of multimedia objects, also called the features, rather than on the objects themselves. Though the feature extraction process is application speci c, the resulting features are most often considered as points in high-dimensional vector spaces (e.g. the color indexing method of Stricker and Orengo [SO95]). Similarit...
متن کاملSimilarity Search in High-Dimensional Data Spaces
This paper summarizes analytical and experimental results for the nearest neighbor similarity search problem in high-dimensional vector spaces using some kind of space-or data-partitioning scheme. Under the assumptions of uniformity and independence of data, we are able to formally show and to demonstrate that conventional approaches to the nearest neighbor problem degenerate if the dimensional...
متن کاملSPY-TEC: An efficient indexing method for similarity search in high-dimensional data spaces
Most of all index structures based on the R-tree have failed to support ecient indexing mechanisms for similarity search in high-dimensional data spaces. This is due to the fact that most of the index structures commonly use balanced split strategy in order to guarantee storage utilization and the shape of queries for similarity search is a hypersphere in high-dimensional spaces. In this paper...
متن کامل